Random model for RNA interference yields scale free network 



m 
O 
o 

» 

o 
O 

(N 

o 

d 

• i-H 

X> 
I 

cr 



> 

O 

o 

m 
o 



I 

cr 



X 




Duygu Balcan 1 and Ay§e Erzan 1 ' 2 
1 Department of Physics, Faculty of Sciences and Letters 
Istanbul Technical University, Maslak 34469, Istanbul, Turkey 
2 Giirsey Institute, P.O.B. 6, Qengelkdy, 34680 Istanbul, Turkey 
(Dated: February 9, 2008) 

We introduce a random bit-string model of post-transcriptional genetic regulation based on se- 
quence matching. The model spontaneously yields a scale free network with power law scaling with 
7 = — 1 and also exhibits log-periodic behaviour. The in-degree distribution is much narrower, and 
exhibits a pronounced peak followed by a Gaussian distribution. The network is of the smallest 
world type, with the average minimum path length independent of the size of the network, as long 
as the network consists of one giant cluster. The percolation threshold depends on the system size. 
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I. INTRODUCTION 

Although biology on the whole is a "knowledge based" 
discipline, with a strong traditional bias towards a reverse 
engineering approach to evolution, and a "form follows 
function" approach, prominant workers in the field have 
stressed that natural selection could very well have op- 
erated on complex structures already present in the pre- 
biotic world. Eigen |fj pointed out that non-linear out of 
equilibrium systems were capable of amplifying random 
fluctuations participating in feed-back loops, while Kauff- 
man |2j, |3J introduced random Boolean networks as null 
models of the genetic regulatory mechanism, and showed 
that they could spontaneously give rise to a great degree 
of complexity. Meanwhile, within the last two decades, 
we have learned a great deal about Self-Organized Criti- 
cality Q , namely the spontaneous emergence of scale free 
structures in open systems far from equilibrium, driven 
by conserved fluxes. Thus, within a statistical mechanics 
context, it is much more natural to consider an ensemble 
of different states among which complex structures arise 
spontaneously. Evolutionary processes can be regarded 
as inducing dynamics on the distribution of states in this 
high dimensional phase space. 

In this paper we introduce a null model for gene in- 
teractions resulting in the regulation of gene expression. 
This model is based on sequence matching on a random 
bit-string representation of the chromosome and is there- 
fore radically different from the random Boolean net- 
works which have been considered before [E IE IE IE IE IE] • 
It can be considered as an out of equilibrium system sub- 
ject to a constant mutation rate, achieving a steady state 
invariant under further mutations. 

We present simulation results which display many 
qualitative features of gene regulation networks found in 
nature [E El • We find that the in- and out- degree dis- 
tributions are qualitatively different from each other, the 
out-degree distribution exhibiting power law decay with 
n{k on t) ~ fcJuti with 7 = — 1 and log- periodic oscillations 
for relatively small k. The in-degree distribution, on the 
other hand, is much more localized. 

The network has smallest world characteristics, with 



FIG. 1: The random sequence of symbols representing a 
chromosome. The "2" represents a start or stop sign for a 
gene, and the 0's and l's code the genes. The arrows indicate 
that one gene sequence is embedded in another one. 



the cluster diameter, or the average minimum path length 
being essentially independent of the cluster size as long 
as the network consists of one cluster. 

The average clustering coefficients for the in and out 
bonds have been calculated as 0.648 and 0.034 respec- 
tively. It is interesting to note that the clustering co- 
efficients for the in-bonds behave much like for classical 
random networks ^E], while the out-bonds have a clus- 
tering coefficient typical of scale free networks. ^E EE 

In section 2, we will first motivate and then define the 
model. In section 3, we will present our simulation re- 
sults. In section 4, we consider a toy model which mimics 
some of the features we observe in our simulations, and 
indicate ways in which this toy model may be improved in 
order to provide insights into how our basic model works. 
Section 5 contains our conclusions and a discussion. 



II. MODELLING RNA INTERFERENCE 



A. Genomic regulatory networks 



Protein networks, which are an important component 
of transcriptional gene regulation networks |l4l Il5| , dis- 
play a scale free structure, with the out-degree distri- 
bution characterised by a power law n{k) ~ k 1 , with 
the exponent 7 ~ —2.5. [E E3 Gene regulation net- 
works actually operate at many different levels. |l6|] Post- 
transcriptional gene regulation, or RNA interference |17| , 
is a mechanism where RNA strips may go and directly 



bind upon complementary segments on messenger RNA 
destined to be translated into some protein, thereby sup- 
pressing the production of this protein. Although we are 
not aware of a scaling analysis of post-transriptional gene 
interaction, it would be a fair guess to assume that inter- 
action complexes of various sizes may arise in this type 
of interaction as well. 

In all protein-protein or intra-genomic interactions, as 
well as the transcription and translation mechanism it- 
self, essential lock-and key mechanisms are in operation. 
For normal translation to take place, rRNA in the ribo- 
somes must be able to recognize and match the differ- 
ent amino acids and the corresponding three-letter anti- 
codon on the mRNA. In this, the rRNA is aided by the 
intermediary tRNA, which assumes a very specific three 
dimensional structure depending on the amino acid to 
which it binds. In transcriptional gene regulation, cer- 
tain proteins known as transcription factors (TF) must 
first be synthesized, and then go and bind onto specific 
"promoter" sites preceding the coding part of a gene, in 
order that the RNA polymerase might start transcribing 
the DNA code into the mRNA. Conversely, the bind- 
ing of other proteins onto the same promoter sites may 
block the binding of the TF, and thereby block the pro- 
duction of the mRNA. H3 In both cases, such bind- 
ing presupposes steric and chemical specificity. In RNA 
interference, the short interfering RNA (siRNA) strips 
bind onto complementary sequences on the mRNA, via 
Watson- Crick base pairing. p^.ll9| 

Although it seems, from the above, as if there is a great 
diversity in these lock-and-key mechanisms, it should 
be realised that they all eventually match linear codes, 
even though this matching may take a few intermediary 
steps. The three dimensional structures (so called sec- 
ondary structures) which come into play either in tRNA 
or the TF, are actually determined by either the sequence 
of ribo-nucleic acids on the tRNA, or the sequence of 
amino acids (primary structure) of the protein constitut- 
ing the TF. Clearly the simplest is direct Watson- Crick 
base pairing between complementary sequences, and it 
is this latter, as it appears in RNA interference, which 
we will take to be the paradigm for our model. In fact 
we will further simplify the matching condition to con- 
sist of the identity relation rather than complementarity, 
since both are one-to-one, we believe that this should not 
change our results. 

B. The Model 

The model is defined as follows. We postulate a "chro- 
mosome" to consist of a sequence of fixed length L, of 
independently and identically distributed random num- 
bers with the probability distribution 

P{x) = P S(x - 2) + (1 - p)/2[S(x - 1) + S(x)} (1) 

We define a "gene" to consist of a sequence of O's and 
1 's situated between the ith and i + 1st occurance of the 
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FIG. 2: Different kinds of vertices allowed on our directed 
random network. While the in or out-neighbors of a ver- 
tex may or may not be connected to each other, in the case 
of the third configuration, with a pair of mixed bonds, the 
neighboring nodes are necessarily connected as shown, due to 
transitivity. 



symbol "2," and we will denote the ith gene by 

Gi = {x i;1 ,x it 2,--- x^} i=l,...,s (2) 

where Xi tli ^ 2, // = 1, . . .li and li is the length of the 
ith gene. We have used periodic boundary conditions, 
but one could just as well agree to end the chromosome 
always with a "2" at the L + 1st site. Then 

a 

J2li = L-s , (3) 

i 

where s is the number of genes (the number of times the 
symbol 2 appears) on the chromosome. Let ne be the 
number of genes of length I. It obeys the sum rule 

L-s 

^n £ = s . (4) 

£=0 

For a given number s of genes, the number of possible 
realisations of a given set {n{\ is 

= , (5) 

Ue=o n t- 

and the most probable distribution n(l) can easily be 
found by using Lagrange multipliers, to be n{l) = 
Lp 2 (l — p) , in the limit of large L. 

With these definitions we obtain a sequence of genes, 
seperated from each other by the symbol 2. (See Fig. 
(JTJ). In case there are more than one consecutive 2's, 
they will be considered to bracket null genes. Clearly, for 
large L, the number of non-null genes, TV, will fluctuate 
around N = Lp — Lp 2 . 

Each of the non-null genes constitutes a node in our 
gene regulation network. The interactions do not depend 
on the proximity of the genes along the chromosome. We 
define the adjacency matrix Wij by the matching condi- 
tion such that 

wu = { 1 * C ■? . (6) 
1 otherwise 

By Gi C Gj we mean x^ = xj ^ +l/ for fx = 1, . . .li for 
at least one integer v such that < v < £j — Note 
that = 1 implies that li < If, in the case of the 
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equality, Wij = 1 if and only if the two sequences Gi and 
Gj are identical, i.e., congruent. Thus, two genes are said 
to interact if the sequence Gi occurs at least once as an 
unbroken subsequence of Gj , i.e., if one can be embedded 
in the other at least once. 

Clearly this adjacency (or connectivity) matrix is di- 
rected. Moreover connectivity is "transitive" in the sense 
that Wij = Wjk = 1 implies that Wik = 1- The latter con- 
dition gives rise to a preferential attachment of incoming 
bonds to large genes, while small genes have an enhanced 
distribution of out-bonds. However, we will see in the 
next section that the degree distribution is scale free for 
the out-bounds, but not for the in-bonds. 

A simple argument tells us that our network is of the 
smallest world ^2 type. If we take the out-bonds, we 
see that due to the transitivity, any two successive edges 
linking say vertex ij and jk, necessarily imply the exis- 



tence of another directed edge ik. Thus I 



(o) 



1, as long 



as the network consists of a single cluster. The in-bonds 
follow the same argument, giving l^) = 1 for a network 
with one giant cluster. 

The question of whether a giant cluster always exists 
or whether we can identify the analog of a "percolation 
threshold" we will address in the next section, where 
we will also report numerical results for the minimum 
path length for undirected bonds (i.e., ignoring the di- 
rectionality of the edges.) We expect that the qualita- 
tive behaviour of the network should not depend on p as 
long as p is bounded away from the percolation thresh- 
old. Requiring the number of vertices, N, to be larger 
than unity, i.e., N = Lp — Lp 2 = Lp(l — p) > 1 gives 
p(l—p) > l/L (however this lower limit turns out not to 
be tight enough, i.e., p c > l/L). The average gene size, 
again in the limit of large L is (£) = (1 — p)/p. The ob- 
vious requirement for a non-trivial network, that {£) > 1 
yields p < 1/2. We therefore expect to find scaling be- 
haviour, if any, for p c < p < 1/2. 

The elements u>y of the connectivity matrix are equal 
to unity with probabilities 
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FIG. 3: The out degree distribution, for p — 0.1 and L 
5000. 



p=0.05, L=15 000 




100 200 300 400 500 600 700 800 



P, 



PttiJj 



(7) 



which depend on the lengths (£j , lj ) through I , and v = 



lj — £i. It is trivial to see that 



P{W)=\^ 



(8) 



FIG. 4: The out degree distribution for p = 0.05 and L 
15000. 



We may make a mean field theory type of approxima- 
tion to P(£, £ + v) by neglecting the correlations between 
overlapping subsequences of Gj, and obtain, 



P(£,£- 



(1 + f) 



(9) 



In the Appendix we have computed P{£, £ + v) explicitly 
for v = 1, 2. However, so far it has not been possible to 
extract the form of the degree distribution analytically. 



The directed graph gives rise to different kinds of ver- 
tices, shown in Fig- i which allow different classes of 
clustering coefficients C. Let k out (i), k in (i), k(i), be, re- 
spectively, the out- degree, in-degree and the total degree 
of the vertex i. The clustering coefficient at a given ver- 
tex i is defined as 



Ci = 



2E(i) 



(10) 



*(*)[*(*) -1] 

where E{i) is the number of edges connecting the nearest 
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neighbors of i. Thus it is the number of pairs of nearest 
neighbors directly connected to each other, normalized 
by the largest number of such connections possible. We 
may extend this concept to directed graphs and define, 



p=0.1, L=5000 



k ut{i)[kout(i) - 1] 



(11) 



and similiarly for C; n , where E out (i) (respectively E- m (i)) 
are the number of edges connecting out (in) nearest 
neighbors of i. Note that any pair of incoming and outgo- 
ing bonds passing through i necessarily defines a triangle, 
due to transitivity, as shown in Fig- ( EI> ■ Thus, the clus- 
tering coefficient may be conveniently decomposed as 



Ci 



fcout(i) + E in (i) + E out {i) 
k(i)[k(i) - l]/2 



(12) 



This is a null model in the sense that no assumptions 
have been made as to the fitness of any particular type 
of interaction; the resulting interactions depend only on 
the random sequences coded in the genes and on the dis- 
tribution of gene lengths. This random network provides 
the "tabula rasa" on which we assume natural selection 
will subsequently act. As such, it is of great interest to 
determine the properties of the null network, which turn 
out to be highly non-trivial. This is the task to which we 
turn in the next section. 



III. SIMULATION RESULTS 

To characterize the network defined by our model, we 
generated random chromosomes as defined above. The 
statistical properties of the network obtained from the 
totally random chromosome were checked to be invariant 
under a constant mutational load, with a mutation prob- 
ability of 0.01. A mutation is affected as follows. If the 
symbol (x) occupying the site to be mutated happens to 
be a or 1, then it is flipped, i.e., we set x = mod2(.x + l). 
If x = 2, then it exchanges places with either its right or 
left nearest neighbor, with equal probability. It should 
be noted that the first case corresponds to a substitu- 
tional mutation, whereas the second to a shifting of the 
position of the start sign which shifts the reading frame 
of the gene. The latter gives rise to far-reaching modifi- 
cations, since the three-letter codes corresponding to the 
different amino acids will be completely modified along 
the whole gene if the reading frame is shifted. 

The most remarkable property of the random network 
generated in this way is that it has a scale free out-degree 
distribution, and a qualitatively different in-degree dis- 
tribution which is much less broad, with a pronounced 
narrow peak followed by a Gaussian peak. These quali- 
tative behaviours match the results found on protein and 
genomic networks to a surprising extent. pjflo| 

We display in Figs. l|3l4fl the out-degree distribution 
n(k out ) for two sets of parameters, namely L — 5000, p 
0.1 and for L = 15000 p = 0.05, with, the number of 




ln [k out ] 



FIG. 5: The log- log fit to the envelope of the out-degree dis- 
tribution shown in Fig.(|3J. The slopes of the dashed and 
continuous lines are -0.9 and -0.39, respectively. 



genes fluctuating around N = 450 and TV = 712, respec- 
tively. The data have been averaged over 500 indepen- 
dent realizations. The distributions, which are strongly 
log periodic, with a power law envelope n m {k ont ) ~ &outi 
have the same characteristic behaviour for the two cases. 
The fits to the envelope of the peaks are shown in 
Figs.JSEJ- We see a marked break in the log- log fit 
for larger p, with a crossover from 7' ~ —1 to 7' = 
—0.45 ± 0.06 at about ln(fc out ) = 2.5. This crossover 
behavior is just off-scale in Fig.( 0), as can be seen in 
Fig- ( 01, with the incipient scaling with a power of — 1 
again extending out to ln(fc out ) ~ 2. 

It is also worthwhile mentioning that the integral over 
the peaks to the rightmost of the out-degree distribution 
simply gives the network size N, as it should, because 
these peaks come from the smallest genes of unit length 
which are embedded in all the other genes. Introducing 
a larger lower cutoff to the gene size (i.e., a cutoff bigger 
than unity) would somewhat restrict the range of fc ou t, 
but also yield cleaner scaling behaviour for large values 
of fc ou t by effectively eliminating these peaks. 

It should be noticed that for fc out < 100, the dips 
in the degree distribution are about evenly spaced on 
the logarithmic scale, with a scale factor of 1/2. We 
conjecture that each individual peak corresponds to a 
particular gene-length I. (This conjecture is borne out 
by comparing the integrals under the peaks with ni 
for p = 0.1, L = 5000 but is much more noisy for 
p = 0.05, L = 15000). Increasing i by unity, exactly 
halves the leading contribution to the to fc out , from genes 
of length I. 

To get a better grip on the degree distribution, we nu- 
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p=0.05, L=15 000 
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FIG. 6: The log- log fit to the envelope of the out-degree dis- 
tribution for p = 0.05, L = 15000 (Fig.©). The fit is to a 
slope of -0.46. 



p=0.05, L=15 000 
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FIG. 7: The initial peak of the out-degree distribution for 
p = 0.05 and L — 15000, and the double logarithmic graph 
of the same, showing incipient power law behaviour with the 
power —0.89. 



merically integrated the curves in Figs. l|3l4f> . The result 
is shown for p = 0.05 in Fig.JSJ). We find that 

/fcout 
n(z)dz ~ ln(fc out ) , (13) 

or a very small power, ~ fc^ 1 where numerically 7 + 1 < 
0.1, at least for fc out that are not too large. Thus we 
are led to believe that the overall out degree distribution 
scales like 

n(k out )~k2 u J(k out /N) , (14) 

with 7 = —1.0 ± 0.1. The scaling function f(x) ~ const, 
for j; < 1, and is log-periodic for intermediate values of 
its argument. Scaling breaks down for x ~ 0(1). Note 
that 7 is smaller in absolute value than that reported for 
real gene regulation networks 0, 0, 0] , thus evolution 
seems to have narrowed the distribution somewhat, re- 
stricting the number of genes that may be affected by 
any one gene. 

The in-dcgrce distribution, which we display in Fig.@ 
shows two peaks. The second can be fit to a Gaussian, 
as shown in Fig. l|10fl . The first peak is more skewed than 
a Poissonian, but may be fitted reasonably well by a dis- 
tribution of the form f(x) ~ (x — xq) s exp[— £(x — xq)], 
where S = 2.2 and £ = 0.15. (Fig.jni)- 

The total degree distribution again displays a modu- 
lated structure, as can be seen in Fig. ljT2l . 

We have already discussed that the network will be- 
have trivially either for very small N (all very large genes 
with very small probabilities for interaction) or for large 
p (i.e., p > 1/2) where most of the chromosome will be 



p=0.05, L=15 000 
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FIG. 8: The wavy line is the cumulative out degree distribu- 
tion for p — 0.05, L — 15000. The dashed line is a logarithmic 
fit, while the continuous one ~ fcout 1 - 



occupied by non-coding partitions (the symbol 2), and 
many null genes. We have determined the threshold 
value, p c , below which the the genes are too few and 
too long, such that the single giant cluster breaks up 
into more than one component. The results are reported 
in Table I, for L ranging from 15000 to 1000, and undi- 
rected edges. Although we have only a few points, the 
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FIG. 9: The in-degree distribution for p = 0.1 and L = 5000. 
The in-degree distribution is much narrower that the out- 
degree distribution and displays two peaks. 



p=0.1, L=5000 
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FIG. 11: The first peak of the in-degree distribution, with 
the tail of the Guassian distribution subtracted. The fits are 
to a Poisson distribution (continuous line) exp(— z)z k /k\ with 
z — k computed from the data points, and the function f(k) = 
0.09(fc - 10) 22 exp[-0.15(x - 10)] (dashed line). 
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p=0.05, L=15 000 
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FIG. 10: The second peak of the in-degree distribution for 
p = 0.1 and L = 5000 can be fit to a Gaussian. 



dependence (see Fig. ITS')) fits a power law, with p c ~ L~ a , 
with a ~ 3/4, within the range explored. In the limit of 
L — > oo, p c — > 0, but less fast than 1/L. 

On the other hand our expectations that there is a 
reasonably large interval of p's (p c < p < 1/2) where the 
results do not depend quantitatively on the precise value 
of p are borne out, as can be seen from our calculations for 
the minimum average path length for undirected edges. 



FIG. 12: The total degree distribution for p = 0.05, L 
15000. 



We argued in section II that as long as there is a single 
giant cluster, the directed minimum path length is iden- 
tically 1. For undirected paths this is no longer true and 
we must determine (Z m i n ) numerically. Nevertheless, we 
can show that i m i n < 4. Note that for p > p c , there will 
be an abundance of short genes of unit length, which will 
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Table I Table II 



L 


(N) 




15000 


177 


0.012 


5000 


112 


0.023 


1500 


83 


0.059 


1000 


54 


0.086 



TABLE I: The "percolation threshold" for strings of differ- 
ent length L. The average number of non-null genes at the 
percolation threshold is also reported. 




1 1 1 1 — i 1 1 — i 1 — i 1 — i — i — 

6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 



ln[L] 



FIG. 13: Dependence of the "percolation threshold" on L. 
The power law fit gives p c ~ _L _3//4 . 



contain the symbol or 1 with equal probability. These 
will have outgoing edges to all the typical genes which 
have an admixture of 0s and Is, so that most such genes 
will be certainly linked by paths of length at most equal 
to 2. The "worst case" is that of atypical cases of all 
l's all or all 0's which require at least one intermediate 
typical gene to link up (11111 - 1 - 011100 - - 00000), 
which gives max(/ m ; n ) = 4. We have calculated (l m i n ) for 
fixed p and different L. If we consider undirected edges 
and find that it at most depends very weakly on N, the 
number of vertices. We find, e.g., for L ranging from 
1650 to 15000, and p — 0.05 that (Z m i n ) ranges only from 
1.673 to 1.699. (Table II) 

Fixing the length L and varying p (Table III) shows 
again that as long as (/ m in) is defined, i.e., just above 
the "percolation threshold" it is already very close to the 
value it will have for larger p. 

The clustering coefficients are markedly different for 
the in- and out-degree distributions, as well as the to- 
tal connectivity. We find, for p = 0.05 and L — 15000, 
< C out >= 0.034, < Ci„ >= 0.648 and < C >= 0.534. 



L 


(AT) 




15000 


717 


1.669 


10000 


479 


1.670 


5000 


239 


1.670 


2500 


120 


1.670 


2000 


95 


1.669 


1750 


83 


1.672 


1650 


79 


1.673 



TABLE II: The average minimum path length for undirected 
edges, with p = 0.05, for different L. 



Table III 





p 


L 


(N) 


(^min) 


0.013 


15000 


194 


1.858 


0.014 


15000 


209 


1.851 


0.016 


15000 


237 


1.84 


0.018 


15000 


267 


1.82 


0.02 


15000 


295 


1.81 


0.05 


15000 


717 


1.669 


0.1 


10000 


452 


1.543 


0.48 


1500 


374 


1.360 





TABLE III: The average minimum path length (? m i n ) for undi- 
rected edges, for different p and L. Note that (lmin) varies very 
little over the entire range from p~p c top~l/2. 



We have verified in the last case that this clustering coeffi- 
cient is reproduced by computing the average total degree 
per node and deviding by the total number of nodes, i.e., 
< C >~< k > /N. Thus, the total connectivity behaves 
very much like in classical random graphs. 



IV. ANALYTICAL RESULTS ON TOY MODEL 

In this section we would like to present a number of the 
results which may be obtained from a hierarchical model. 
In particular, we would like to exploit the transitivity 
property, to compute certain properties of a toy version 
of our original model. 

Let us consider a tree network, with a branching ratio 
b. Take all connections emanating from nodes at the 
mth generation to the ones at the m + 1st generation, 
to be directed. If one applies the transitivity rule, then 
this automatically generates further connections that link 
directly a node at the mth generation to all the nodes 
below it on the same branch of the tree. Then, the out 
degree (we will drop the index "out" from now on) of a 
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node at the mth level will be 



k(m) 



M—m—l 

E < 

m' — 



• M—m 



1 



(15) 



where M is the total number of levels and k(m) ~ b M ~ m 
for large M — m, which we may safely assume. The fre- 
quency of the nodes at the mth level is b m . This gives, 
for the out-degree distribution, 



n(k) ~ const. ( — 
k 



(16) 



We now introduce redundancies within the same level. 
Let a fraction r of the b downstream neighbors emanating 
from any node be identical to each other, with r = a m , 
b^ 1 < a < 1 being some constant. The nodes that are 
identical within any given level are connected to each 
other by two-way bonds within the sub-branch where 
they are located. I will call this subset of nodes the clones 
at any given level. This means that the out-degree k of 
any one of these nodes is now k' — (a rn b)b M ~ m , since 
each interconnected node inherits all the downstream 
neighbors of the a rn b clones to which it is connected. 
The interconnections between nodes further downstream 
do not introduce any further change in fc'(m), since all 
the nodes that they can connect are already connected 
to the nodes at level m. The number of clone-nodes (not 
all identical to each other, but only in groups of a m b) is 
clearly a m b m . Then, the out-degree distribution of the 
nodes belonging to the clone-sets becomes, 



n(k') = b~<- 



where N = b and with 



lrn 



ln& 



7 = 



In a — In b 



(17) 



(18) 



Note that under our assumptions above, — 1 < 7 < 0. For 
large enough k this then becomes the leading contribution 
to the out degree distribution, rather than 7 = — 1. 



genes, subjected to random point mutations over a very 
long time period. The scale free steady state can be con- 
sidered as the outcome of a process of self-organization 
under random perturbations. 

In conclusion it is worthwhile to note that in RNA 
interference, longer RNA chains are cut up by the "Dicer" 
enzyme into smaller segments, siRNA |l7| . which can be 
conveniently matched with complementary sequences in 
a larger number of mRNAs. This trick would change 
the power of the out-degree distribution from that found 
here. 
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APPENDIX We would like to compute the proba- 
bilities P{£,£') in Eq.10). For successive v = £'-£> 0, 
it is convenient to define subsequences of Gj of length 



L with a shift A, < A < 



such that 



{xj t i + \, Xj,2+\, ■ ■ -Xj.it+x}- Then, we can obtain P(£,£') 
in terms of joint probabilities like V(Gi ^ gf) ,Gi = 
gff), etc. Thus, 

p(i,i) 

P(£,£+l) 



= V(G i =gW) 
= V(G, 



■V(Gi?glf,Gi 



P(£,£ + v) =V (G l =g ( g ) ) + 
Jo) 



(i) 



V(G i ^g^,G i ^g v . 



Let us display V(Gi ^ Gi = g^) for greater clar- 
ity. We have 



V(G i ?gW,G i = g$>) 




V. CONCLUSIONS AND DISCUSSION 

The model we have presented for the scaling behaviour 
of gene interaction, more specifically RNA interference, 
turns out to be a rich model which is worthwhile to in- 
vestigate in its own right, as well as providing a highly 
structured starting point on which further evolutionary 
pressures could act. 

We have checked that the statistical properties of the 
network are robust under random point mutations. Thus, 
the scale free random network may be seen as the attra- 
tor for a process starting from, say, a sequence of uniform 



In particular we find 

P(£,£ + l) = 2~ e [l + 1 - 2~ e ] 
= 2- e [2-2- e ] 

P(l,l + 2) = 2~' [3 - 2- l+l - - (1 - 2- 2t ) 

3 

Note that the probabilities V do not depend upon the 
particular gene, except through the lengths of the se- 
quences, as the genes do not overlap, and each of the 
symbols are independently and identically distributed, 
with equal probability, over and 1. 
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